Hindi-English Language Identification, Named Entity Recognition and Back Transliteration: Shared Task System Description
نویسندگان
چکیده
This paper presents an algorithm for word level language identification, named entity recognition and classification, and transliteration of Indian language words written in the Roman script to their native Devanagari script from bilingual textual data. We propose the construction of an extensive, hierarchical structured dictionary and hierarchical rule-based classifier to expedite word search and language identification. The proposed method uses lexical, contextual and special character features particular to Hindi and English. With a few modifications to the system, the present solution can be replicated for other languages. The system we have submitted shows the best performance in English token level precision (0.895) and the second best in Indian language token recall (0.915). The transliteration level f-measure is relatively low (0.15); this can be significantly improved with a more representative and exhaustive training data.
منابع مشابه
LIGA and Syllabification Approach for Language Identification and Back Transliteration : Shared Task Report by DAIICT
This paper aims to address the solution for the Subtask 1 of Shared Task on transliterated search,a task in FIRE ’14. The task addresses the problem of data containing English words and transliterated words of Indian languages in English.The task calls for language identification and subsequent back transliteration into the native Indian scripts.The system proposed herewith implements Language ...
متن کاملNamed Entity Recognition in Hindi using Maximum Entropy and Transliteration
(NER) system becomes challenging if proper resources are not available. Gazetteer lists are often used for the development of NER systems. In many resource-poor languages gazetteer lists of proper size are not available, but sometimes relevant lists are available in English. Proper transliteration makes the English lists useful in the NER tasks for such languages. In this paper, we have describ...
متن کاملNEWS 2009 Machine Transliteration Shared Task System Description: Transliteration with Letter-to-Phoneme Technology
We interpret the problem of transliterating English named entities into Hindi or Japanese Katakana as a variant of the letter-to-phoneme (L2P) subtask of textto-speech processing. Therefore, we apply a re-implementation of a state-of-the-art, discriminative L2P system (Jiampojamarn et al., 2008) to the problem, without further modification. In doing so, we hope to provide a baseline for the NEW...
متن کاملA Hybrid Approach of English- Hindi Named-entity Transliteration
In recent years, machine transliteration has gained a center of attention for research. Both machine translation and transliteration are important for e-governance and web based online multilingual applications. As machine translation translate source language to target language which results in wrong translation for named entities. Named entities are required to be translated with preserving t...
متن کاملCRF-based Named Entity Recognition @ICON 2013
This paper describes performance of CRF based systems for Named Entity Recognition (NER) in Indian language as a part of ICON 2013 shared task. In this task we have considered a set of language independent features for all the languages. Only for English a language specific feature, i.e. capitalization, has been added. Next the use of gazetteer is explored for Bengali, Hindi and English. The ga...
متن کامل